157 research outputs found
Visual7W: Grounded Question Answering in Images
We have seen great progress in basic perceptual tasks such as object
recognition and detection. However, AI models still fail to match humans in
high-level vision tasks due to the lack of capacities for deeper reasoning.
Recently the new task of visual question answering (QA) has been proposed to
evaluate a model's capacity for deep image understanding. Previous works have
established a loose, global association between QA sentences and images.
However, many questions and answers, in practice, relate to local regions in
the images. We establish a semantic link between textual descriptions and image
regions by object-level grounding. It enables a new type of QA with visual
answers, in addition to textual answers used in previous work. We study the
visual QA tasks in a grounded setting with a large collection of 7W
multiple-choice QA pairs. Furthermore, we evaluate human performance and
several baseline models on the QA tasks. Finally, we propose a novel LSTM model
with spatial attention to tackle the 7W QA tasks.Comment: CVPR 201
Action Recognition by Hierarchical Mid-level Action Elements
Realistic videos of human actions exhibit rich spatiotemporal structures at
multiple levels of granularity: an action can always be decomposed into
multiple finer-grained elements in both space and time. To capture this
intuition, we propose to represent videos by a hierarchy of mid-level action
elements (MAEs), where each MAE corresponds to an action-related spatiotemporal
segment in the video. We introduce an unsupervised method to generate this
representation from videos. Our method is capable of distinguishing
action-related segments from background segments and representing actions at
multiple spatiotemporal resolutions. Given a set of spatiotemporal segments
generated from the training data, we introduce a discriminative clustering
algorithm that automatically discovers MAEs at multiple levels of granularity.
We develop structured models that capture a rich set of spatial, temporal and
hierarchical relations among the segments, where the action label and multiple
levels of MAE labels are jointly inferred. The proposed model achieves
state-of-the-art performance in multiple action recognition benchmarks.
Moreover, we demonstrate the effectiveness of our model in real-world
applications such as action recognition in large-scale untrimmed videos and
action parsing
Scene Graph Generation by Iterative Message Passing
Understanding a visual scene goes beyond recognizing individual objects in
isolation. Relationships between objects also constitute rich semantic
information about the scene. In this work, we explicitly model the objects and
their relationships using scene graphs, a visually-grounded graphical structure
of an image. We propose a novel end-to-end model that generates such structured
scene representation from an input image. The model solves the scene graph
inference problem using standard RNNs and learns to iteratively improves its
predictions via message passing. Our joint inference model can take advantage
of contextual cues to make better predictions on objects and their
relationships. The experiments show that our model significantly outperforms
previous methods for generating scene graphs using Visual Genome dataset and
inferring support relations with NYU Depth v2 dataset.Comment: CVPR 201
Learning Generalizable Manipulation Policies with Object-Centric 3D Representations
We introduce GROOT, an imitation learning method for learning robust policies
with object-centric and 3D priors. GROOT builds policies that generalize beyond
their initial training conditions for vision-based manipulation. It constructs
object-centric 3D representations that are robust toward background changes and
camera views and reason over these representations using a transformer-based
policy. Furthermore, we introduce a segmentation correspondence model that
allows policies to generalize to new objects at test time. Through
comprehensive experiments, we validate the robustness of GROOT policies against
perceptual variations in simulated and real-world environments. GROOT's
performance excels in generalization over background changes, camera viewpoint
shifts, and the presence of new object instances, whereas both state-of-the-art
end-to-end learning methods and object proposal-based approaches fall short. We
also extensively evaluate GROOT policies on real robots, where we demonstrate
the efficacy under very wild changes in setup. More videos and model details
can be found in the appendix and the project website:
https://ut-austin-rpl.github.io/GROOT .Comment: Accepted at the 7th Annual Conference on Robot Learning (CoRL), 2023
in Atlanta, U
Doduo: Learning Dense Visual Correspondence from Unsupervised Semantic-Aware Flow
Dense visual correspondence plays a vital role in robotic perception. This
work focuses on establishing the dense correspondence between a pair of images
that captures dynamic scenes undergoing substantial transformations. We
introduce Doduo to learn general dense visual correspondence from in-the-wild
images and videos without ground truth supervision. Given a pair of images, it
estimates the dense flow field encoding the displacement of each pixel in one
image to its corresponding pixel in the other image. Doduo uses flow-based
warping to acquire supervisory signals for the training. Incorporating semantic
priors with self-supervised flow training, Doduo produces accurate dense
correspondence robust to the dynamic changes of the scenes. Trained on an
in-the-wild video dataset, Doduo illustrates superior performance on
point-level correspondence estimation over existing self-supervised
correspondence learning baselines. We also apply Doduo to articulation
estimation and zero-shot goal-conditioned manipulation, underlining its
practical applications in robotics. Code and additional visualizations are
available at https://ut-austin-rpl.github.io/DoduoComment: Project website: https://ut-austin-rpl.github.io/Dodu
Neural Task Programming: Learning to Generalize Across Hierarchical Tasks
In this work, we propose a novel robot learning framework called Neural Task
Programming (NTP), which bridges the idea of few-shot learning from
demonstration and neural program induction. NTP takes as input a task
specification (e.g., video demonstration of a task) and recursively decomposes
it into finer sub-task specifications. These specifications are fed to a
hierarchical neural program, where bottom-level programs are callable
subroutines that interact with the environment. We validate our method in three
robot manipulation tasks. NTP achieves strong generalization across sequential
tasks that exhibit hierarchal and compositional structures. The experimental
results show that NTP learns to generalize well to- wards unseen tasks with
increasing lengths, variable topologies, and changing objectives.Comment: ICRA 201
MUTEX: Learning Unified Policies from Multimodal Task Specifications
Humans use different modalities, such as speech, text, images, videos, etc.,
to communicate their intent and goals with teammates. For robots to become
better assistants, we aim to endow them with the ability to follow instructions
and understand tasks specified by their human partners. Most robotic policy
learning methods have focused on one single modality of task specification
while ignoring the rich cross-modal information. We present MUTEX, a unified
approach to policy learning from multimodal task specifications. It trains a
transformer-based architecture to facilitate cross-modal reasoning, combining
masked modeling and cross-modal matching objectives in a two-stage training
procedure. After training, MUTEX can follow a task specification in any of the
six learned modalities (video demonstrations, goal images, text goal
descriptions, text instructions, speech goal descriptions, and speech
instructions) or a combination of them. We systematically evaluate the benefits
of MUTEX in a newly designed dataset with 100 tasks in simulation and 50 tasks
in the real world, annotated with multiple instances of task specifications in
different modalities, and observe improved performance over methods trained
specifically for any single modality. More information at
https://ut-austin-rpl.github.io/MUTEX/Comment: Accepted at 7th Conference on Robot Learning (CoRL 2023), Atlanta,
US
- …